JAMIA Open — Latest Matching Preprints

1

Medications that Regulate Gastrointestinal Transit Influence Inpatient Blood Glucose

Momenzadeh, A.; Cranney, C. W.; Choi, S. Y.; Bresee, C.; Tighiouart, M.; Gianchandani, R.; Pevnick, J.; Moore, J.; Meyer, J.

2024-08-02 health informatics 10.1101/2024.07.31.24311287 medRxiv

Top 0.1%

41.6%

Show abstract

ObjectiveA multitude of factors affect a hospitalized individuals blood glucose (BG), making BG difficult to predict and manage. Beyond medications well established to alter BG, such as beta-blockers, there are likely many medications with undiscovered effects on BG variability. Identification of these medications and the strength and timing of these relationships has potential to improve glycemic management and patient safety. Materials and MethodsEHR data from 103,871 inpatient encounters over 8 years within a large, urban health system was used to extract over 500 medications, laboratory measurements, and clinical predictors of BG. Feature selection was performed using an optimized Lasso model with repeated 5-fold cross-validation on the 80% training set, followed by a linear mixed regression model to evaluate statistical significance. Significant medication predictors were then evaluated for novelty against a comprehensive adverse drug event database. ResultsWe found 29 statistically significant features associated with BG; 24 were medications including 10 medications not previously documented to alter BG. The remaining five factors were Black/African American race, history of type 2 diabetes mellitus, prior BG (mean and last) and creatinine. DiscussionThe unexpected medications, including several agents involved in gastrointestinal motility, found to affect BG were supported by available studies. This study may bring to light medications to use with caution in individuals with hyper- or hypoglycemia. Further investigation of these potential candidates is needed to enhance clinical utility of these findings. ConclusionThis study uniquely identifies medications involved in gastrointestinal transit to be predictors of BG that may not well established and recognized in clinical practice.

2

Synthetic Health Data Can Augment Community Research Efforts to Better Inform the Public During Emerging Pandemics

Prasanna, A.; Jing, B.; Plopper, G.; Krasnov Miller, K.; Sanjak, J.; Feng, A.; Prezek, S.; Vidyaprakash, E.; Thovarai, V.; Maier, E.; Bhattacharya, A.; Naaman, L.; Stephens, H.; Watford, S.; Boscardin, W. J.; Johanson, E.; Lienau, A.

2023-12-13 health informatics 10.1101/2023.12.11.23298687 medRxiv

Top 0.1%

36.9%

Show abstract

The COVID-19 pandemic had disproportionate effects on the Veteran population due to the increased prevalence of medical and environmental risk factors. Synthetic electronic health record (EHR) data can help meet the acute need for Veteran population-specific predictive modeling efforts by avoiding the strict barriers to access, currently present within Veteran Health Administration (VHA) datasets. The U.S. Food and Drug Administration (FDA) and the VHA launched the precisionFDA COVID-19 Risk Factor Modeling Challenge to develop COVID-19 diagnostic and prognostic models; identify Veteran population-specific risk factors; and test the usefulness of synthetic data as a substitute for real data. The use of synthetic data boosted challenge participation by providing a dataset that was accessible to all competitors. Models trained on synthetic data showed similar but systematically inflated model performance metrics to those trained on real data. The important risk factors identified in the synthetic data largely overlapped with those identified from the real data, and both sets of risk factors were validated in the literature. Tradeoffs exist between synthetic data generation approaches based on whether a real EHR dataset is required as input. Synthetic data generated directly from real EHR input will more closely align with the characteristics of the relevant cohort. This work shows that synthetic EHR data will have practical value to the Veterans health research community for the foreseeable future.

3

NHANES-GPT: Large Language Models (LLMs) and the Future of Biostatistics

Titus, A. J.

2023-12-15 health informatics 10.1101/2023.12.13.23299830 medRxiv

Top 0.1%

32.8%

Show abstract

BackgroundLarge Language Models (LLMs) like ChatGPT have significant potential in biomedicine and health, particularly in biostatistics, where they can lower barriers to complex data analysis for novices and experts alike. However, concerns regarding data accuracy and model-generated hallucinations necessitate strategies for independent verification. ObjectiveThis study, using NHANES data as a representative case study, demonstrates how ChatGPT can assist clinicians, students, and trained biostatisticians in conducting analyses and illustrates a method to independently verify the information provided by ChatGPT, addressing concerns about data accuracy. MethodsThe study employed ChatGPT to guide the analysis of obesity and diabetes trends in the NHANES dataset from 2005-2006 to 2017-2018. The process included data preparation, logistic regression modeling, and iterative refinement of analyses with confounding variables. Verification of ChatGPTs recommendations was conducted through direct statistical data analysis and cross-referencing with established statistical methodologies. ResultsChatGPT effectively guided the statistical analysis process, simplifying the interpretation of NHANES data. Initial models indicated increasing trends in obesity and diabetes prevalence in the U.S.. Adjusted models, controlling for confounders such as age, gender, and socioeconomic status, provided nuanced insights, confirming the general trends but also highlighting the influence of these factors. ConclusionsChatGPT can facilitate biostatistical analyses in healthcare research, making statistical methods more accessible. The study also underscores the importance of independent verification mechanisms to ensure the accuracy of LLM-assisted analyses. This approach can be pivotal in harnessing the potential of LLMs while maintaining rigorous standards of data accuracy and reliability in biomedical research.

4

Clinical encounter heterogeneity and methods for resolving in networked EHR data: A study from N3C and RECOVER programs

Leese, P. J.; Anand, A.; Girvin, A.; Bennett, T.; Hajagos, J.; Patel, S.; Yoo, J.; Pfaff, E.; Moffitt, R.

2022-10-17 health informatics 10.1101/2022.10.14.22281106 medRxiv

Top 0.1%

26.0%

Show abstract

OBJECTIVEClinical encounter data are heterogeneous and vary greatly from institution to institution. These problems of variance affect interpretability and usability of clinical encounter data for analysis. These problems are magnified when multi-site electronic health record data are networked together. This paper presents a novel, generalizable method for resolving encounter heterogeneity for analysis by combining related atomic encounters into composite macrovisits. MATERIALS AND METHODSEncounters were composed of data from 75 partner sites harmonized to a common data model as part of the NIH Researching COVID to Enhance Recovery Initiative, a project of the National Covid Cohort Collaborative. Summary statistics were computed for overall and site-level data to assess issues and identify modifications. Two algorithms were developed to refine atomic encounters into cleaner, analyzable longitudinal clinical visits. RESULTSAtomic inpatient encounters data were found to be widely disparate between sites in terms of length-of-stay and numbers of OMOP CDM measurements per encounter. After aggregating encounters to macrovisits, length-of-stay (LOS) and measurement variance decreased. A subsequent algorithm to identify hospitalized macrovisits further reduced data variability. DISCUSSIONEncounters are a complex and heterogeneous component of EHR data and native data issues are not addressed by existing methods. These types of complex and poorly studied issues contribute to the difficulty of deriving value from EHR data, and these types of foundational, large-scale explorations and developments are necessary to realize the full potential of modern real world data. CONCLUSIONThis paper presents method developments to manipulate and resolve EHR encounter data issues in a generalizable way as a foundation for future research and analysis.

5

MENDS-on-FHIR: Leveraging the OMOP common data model and FHIR standards for national chronic disease surveillance

Essaid, S.; Andre, J.; Brooks, I. M.; Hohman, K. H.; Hull, M.; Jackson, S. L.; Kahn, M. G.; Kraus, E. M.; Mandadi, N.; Martinez, A. K.; Mui, J. Y.; Zambarano, B.; Soares, A.

2023-08-15 health informatics 10.1101/2023.08.09.23293900 medRxiv

Top 0.1%

25.9%

Show abstract

ObjectiveThe Multi-State EHR-Based Network for Disease Surveillance (MENDS) is a population-based chronic disease surveillance distributed data network that uses institution-specific extraction-transformation-load (ETL) routines. MENDS-on-FHIR examined using Health Language Sevens Fast Healthcare Interoperability Resources (HL7(R) FHIR(R)) and US Core Implementation Guide (US Core IG) compliant resources derived from the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) to create a standards-based ETL pipeline. Materials and MethodsThe input data source was a research data warehouse containing clinical and administrative data in OMOP CDM Version 5.3 format. OMOP-to-FHIR transformations, using a unique JavaScript Object Notation (JSON)-to-JSON transformation language called Whistle, created FHIR R4 V4.0.1/US Core IG V4.0.0 conformant resources that were stored in a local FHIR server. A REST-based Bulk FHIR $export request extracted FHIR resources to populate a local MENDS database. ResultsEleven OMOP tables were used to create 10 FHIR/US Core compliant resource types. A total of 1.13 trillion resources were extracted and inserted into the MENDS repository. A very low rate of non-compliant resources was observed. DiscussionOMOP-to-FHIR transformation results passed validation with less than a 1% non-compliance rate. These standards-compliant FHIR resources provided standardized data elements required by the MENDS surveillance use case. The Bulk FHIR application programming interface (API) enabled population-level data exchange using interoperable FHIR resources. The OMOP-to-FHIR transformation pipeline creates a FHIR interface for accessing OMOP data. ConclusionMENDS-on-FHIR successfully replaced custom ETL with standards-based interoperable FHIR resources using Bulk FHIR. The OMOP-to-FHIR transformations provide an alternative mechanism for sharing OMOP data. LAY ABSTRACTMany chronic conditions, such as hypertension, obesity, and diabetes are becoming more prevalent, especially in high-risk individuals, such as minorities and low-income patients. Public health surveillance networks measure the presence of specific conditions repeatedly over time, seeking to detect changes in the amount of a disease conditions so that public health officials can implement new early-prevention programs or evaluate the impact of an existing prevention program. Data stored in electronic health records (EHRs) could be used to measure the presence of health conditions, but significant technical barriers make current methods for data extraction laborious and costly. HL7 BULK FHIR is a new data standard that is required to be available in all commercial EHR systems in the United States. We examined the use of BULK FHIR to provide EHR data to an existing public health surveillance network called MENDS. We found that HL7 BULK FHIR can provide the necessary data elements for MENDS in a standardized format. Using HL7 BULK FHIR could significantly reduce barriers to data for public health surveillance needs, enabling public health officials to expand the diversity of locations and patient populations being monitored.

6

Automated identification of unstandardized medication data: A scalable and flexible data standardization pipeline using RxNorm on GEMINI multicenter hospital data

Waters, R.; Malecki, S.; Lail, S.; Mak, D.; Saha, S.; Jung, H. Y.; Razak, F.; Verma, A.

2022-02-21 health informatics 10.1101/2022.02.16.22268694 medRxiv

Top 0.1%

23.1%

Show abstract

ObjectivePatient data repositories often assemble medication data from multiple sources, necessitating standardization prior to analysis. We implemented and evaluated a medication standardization procedure for use with a wide range of pharmacy data inputs across all drug categories, which supports research queries at multiple levels of granularity. MethodsThe GEMINI-RxNorm system automates the use of multiple RxNorm tools in tandem with other datasets to identify drug concepts from pharmacy orders. GEMINI-RxNorm was used to process 2,090,155 pharmacy orders from 245,258 hospitalizations between 2010 and 2017 at 7 hospitals in Ontario, Canada. The GEMINI-RxNorm system matches drug-identifying information from pharmacy data (including free-text fields) to RxNorm concept identifiers. A user interface allows researchers to search for drug terms and returns the relevant original pharmacy data through the matched RxNorm concepts. Users can then manually validate the predicted matches and discard false positives. We designed the system to maximize recall (sensitivity) and enable excellent precision (positive predictive value) with minimal manual validation. We compared the performance of this system to manual coding (by a physician and pharmacist) of 13 medication classes. ResultsManual coding was performed for 1,948,817 pharmacy orders and GEMINI-RxNorm successfully returned 1,941,389 (99.6%) orders. Recall was greater than 98.5% in all 13 drug classes, and the F-Measure and precision remained above 90.0% in all drug classes, facilitating efficient manual review to achieve 100.0% precision. GEMINI-RxNorm saved time substantially compared to manual standardization, reducing the time taken to review a pharmacy order row from an estimated 30 seconds to 5 seconds and reducing the number of rows needed to be reviewed by up to 99.99%. Discussion and ConclusionGEMINI-RxNorm presents a novel combination of RxNorm tools and other datasets to enable accurate, efficient, flexible, and scalable standardization of pharmacy data. By facilitating efficient minimal manual validation, the GEMINI-RxNorm system can allow researchers to achieve near-perfect accuracy in medication data standardization.

7

Development of the Centralized Interactive Phenomics Resource (CIPHER) Standard for Electronic Health Data-Based Phenomics Knowledgebase

Honerlaw, J.; Ho, Y.-L.; Fontin, F.; Gosian, J.; Maripuri, M.; Murray, M.; Sangar, R.; Galloway, A.; Zimolzak, A. J.; Whitbourne, S. B.; Casas, J. P.; Ramoni, R.; Gagnon, D. R.; Cai, T.; Liao, K. P.; Gaziano, J. M.; Muralidhar, S.; Cho, K.

2022-09-15 health informatics 10.1101/2022.09.12.22279792 medRxiv

Top 0.1%

22.9%

Show abstract

The development of phenotypes using electronic health records is a resource intensive process. Therefore, the cataloging of phenotype algorithm metadata for reuse is critical to accelerate clinical research. The Department of Veterans Affairs Office of Research and Development has developed a phenomics knowledgebase library, CIPHER (Centralized Interactive Phenomics Research), which improves upon existing phenomics library models to help advance innovation in clinical research by using the CIPHER phenotype collection standard. The CIPHER standard was iteratively developed with phenomics experts and has been used to capture over 5,000 phenotypes. We describe the development of the CIPHER standard for phenotype metadata collection, its current application to the largest healthcare system in the United States, and the future expansion of the CIPHER knowledgebase as a public resource for phenotyping.

8

Quantifying the severity of patient safety events via statistical natural language processing

Bhadra, S.; Fong, A.; Sengupta, S.

2025-12-27 health informatics 10.64898/2025.12.22.25342876 medRxiv

Top 0.1%

22.9%

Show abstract

Medical errors are one of the leading causes of death in the United States. Several public databases have been built to record patient safety events across healthcare systems to better understand and improve safety hazards. These reports typically include both structured fields (e.g., event type, device, manufacturer) and unstructured data elements (free text narrative of what happened). The structured fields are usually restricted to a limited number of categories, whereas the unstructured fields allow the reporter to freely describe the event details. Thus, analyzing the unstructured text, rather than the structured fields, can reveal rich insights that can help improve patient safety. However, manual analysis of these databases is impractical due to their large size and the inherent subjectivity of manual interpretation. Therefore, we need new statistical algorithms to automate this process. In this paper, we develop a novel statistical technique to predict the severity level of a patient safety event based on its free text description. Using NLP techniques, we first express the raw event descriptions as numeric feature vectors and then use statistical techniques to model the severity of the events based on the feature vectors. We consider and compare three statistical approaches: multiclass (one-shot), ordinal, and hierarchical (two-step) models. To illustrate the proposed method, we analyzed a large text corpus of more than 7.7 million patient safety reports from FDAs MAUDE (Manufacturer and User Facility Device Experience) database. The proposed techniques correctly predicted the reported outcome of the events with above 94% accuracy. Furthermore, our techniques helped identify critical terms/phrases and provide a continuous-scale harm score, which can be more useful than a discrete severity level. Inspecting the misclassified reports, we discovered some likely occurrences of mislabeled reports which are correctly classified by our proposed approach.

9

Assessing the Quality of Electronic Health Record Data and the Claims Linked Data for Target Trial Emulation Studies

Lee, Y. A.; Lu, Y.; Morris, E. J.; He, X.; Winterstein, A. G.; Henriksen, C.; Bian, J.; Guo, J.

2025-12-29 health informatics 10.64898/2025.12.22.25342844 medRxiv

Top 0.1%

22.9%

Show abstract

ObjectivesTo evaluate whether a EHR cohort, alone and linked to Medicare claims, has sufficient data quality to support design elements required for target trial emulation, using type 2 diabetes (T2D) as a case example. Materials and MethodsWe constructed annual University of Florida Health EHR-Medicare linked cohorts of patients [≥] 65 years with T2D from 2013 to 2020. Using Medicare claims as the reference, we assessed EHR data quality for target trial emulation-relevant elements across completeness, accuracy, plausibility, and concordance, spanning target trial components (eligibility, exposure/new-user ascertainment, baseline covariates, outcomes, and follow-up). Data quality was compared across EHR-only, claims-only, and EHR-claims linked data. ResultsThe mean annual EHR-Medicare linked cohort included 12,895 patients (mean age 74.9 years; 58.0% female). Demographics were complete and highly accurate. In the EHR-only cohort, completeness ranged 34.1-78.4% for conditions and 53.7-63.4% for glucose lowering drugs (GLDs). Accuracy was high for prevalent conditions and GLD use but low for incident measures. Plausible values were common (>98.5%), and HbA1c - T2D concordance was strong (98.6%). Linking EHR and claims substantially improved completeness and accuracy, especially for encounters, mortality, incident diagnoses, and medications. DiscussionThe linked dataset addressed major limitations of EHR-only data and provided enhanced granularity compared to claims alone, offering a comprehensive resource for real-world target trial emulation research. ConclusionEHRs offer valuable clinical details but face data quality challenges. Robust quality assurance strategies and linkage with external data are essential to strengthen real-world evidence and support target trial emulation. Lay SummaryWe evaluated whether a University of Florida Health electronic health record (EHR) cohort--alone and when linked to Medicare claims--has sufficient data quality to support "target trial emulation," a common approach for using real-world data to study medication effects when randomized trials are not feasible. We studied adults aged 65 years and older with type 2 diabetes from 2013-2020 and assessed four practical dimensions of data quality: completeness (how often key information is captured), accuracy (agreement with Medicare for billing-derived elements), plausibility (whether recorded values are clinically reasonable), and concordance (internal consistency between related EHR elements). Demographic fields were highly complete and accurate, and most lab and vital sign values were biologically plausible, supporting the reliability of core EHR clinical measurements. However, the EHR alone missed a substantial share of encounters, deaths, incident diagnoses, and medication initiation events that appeared in Medicare, reflecting care received outside a single health system. Linking EHR with Medicare substantially improved capture of these cross-setting events while preserving EHR-only clinical details (e.g., HbA1c and BMI), yielding a more robust dataset for real-world target trial emulation research.

10

Deep phenotyping obesity using EHR data: Promise, Challenges, and Future Directions

Ruan, X.; Lu, S.; Wang, L.; Wen, A.; Murali, S. B.; Liu, H.

2024-12-08 health informatics 10.1101/2024.12.06.24318608 medRxiv

Top 0.1%

22.8%

Show abstract

Obesity affects approximately 34% of adults and 15-20% of children and adolescents in the U.S, and poses significant economic and psychosocial burdens. Due to the multifaceted nature of obesity, currently patient responses to any single anti-obesity medication (AOM) vary significantly, highlighting the need for developing approaches to obesity deep phenotyping and associated precision medicine. While recent advancement in classical phenotyping-guided pharmacotherapies have shown clinical value, they are less embraced by healthcare providers within the precision medicine framework, primarily due to their operational complexity and lack of granularity. From this perspective, several recent review articles highlighted the importance of obesity deep phenotyping for personalized precision medicine. In view of the established role of electronic health record (EHR) as an important data source for clinical phenotypings, we offer an in-depth analysis of the commonly available data elements from obesity patients prior to pharmacotherapy. We also experimented with a multi-modal longitudinal deep autoencoder to explore the feasibility, data requirements, clustering patterns, and challenges associated with EHR-based obesity deep phenotyping. Our analysis indicates at least nine clusters, among which five have distinct explainable clinical relevance. Further research within larger independent cohorts to validate the reproducibility, uncover more detailed substructures and corresponding treatment response is warranted. BackgroundObesity affects approximately 40% of adults and 15-20% of children and adolescents in the U.S, and poses significant economic and psychosocial burdens. Currently, patient responses to any single anti-obesity medication (AOM) vary significantly, making obesity deep phenotyping and associated precision medicine important targets of investigation. ObjectiveTo evaluate the potential of EHR as a primary data source for obesity deep phenotyping, we conduct an in-depth analysis of the data elements and quality available from obesity patients prior to pharmacotherapy, and apply a multi-modal longitudinal deep autoencoder to investigate the feasibility, data requirements, clustering patterns, and challenges associated with EHR-based obesity deep phenotyping. MethodsWe analyzed 53,688 pre-AOM periods from 32,969 patients with obesity or overweight who underwent medium- to long-term AOM treatment. A total of 92 lab and vital measurements, along with 79 ICD-derived clinical classifications software (CCS) codes recorded within one year prior to AOM treatment, were used to train a gated recurrent unit with decay based longitudinal autoencoder (GRU-D-AE) to generate dense embeddings for each pre-AOM record. principal component analysis (PCA) and gaussian mixture modeling (GMM) were applied to identify clusters. ResultsOur analysis identified at least nine clusters, with five exhibiting distinct and explainable clinical relevance. Certain clusters show characteristics overlapping with phenotypes from traditional phenotyping strategy. Results from multiple training folds demonstrated stable clustering patterns in two-dimensional space and reproducible clinical significance. However, challenges persist regarding the stability of missing data imputation across folds, maintaining consistency in input features, and effectively visualizing complex diseases in low-dimensional spaces ConclusionIn this proof-of-concept study, we demonstrated longitudinal EHR as a valuable resource for deep phenotyping the pre-AOM period at per patient visit level. Our analysis revealed the presence of clusters with distinct clinical significance, which could have implications in AOM treatment options. Further research using larger, independent cohorts is necessary to validate the reproducibility and clinical relevance of these clusters, uncover more detailed substructures and corresponding AOM treatment responses.

11

Deep Learning Approach to Parse Eligibility Criteria in Dietary Supplements Clinical Trials Following OMOP Common Data Model

Bompelli, A.; Li, J.; Xu, Y.; Wang, N.; Wang, Y.; Adam, T.; He, Z.; Zhang, R.

2020-09-18 health informatics 10.1101/2020.09.16.20196022 medRxiv

Top 0.1%

22.7%

Show abstract

Dietary supplements (DSs) have been widely used in the U.S. and evaluated in clinical trials as potential interventions for various diseases. However, many clinical trials face challenges in recruiting enough eligible patients in a timely fashion, causing delays or even early termination. Using electronic health records to find eligible patients who meet clinical trial eligibility criteria has been shown as a promising way to assess recruitment feasibility and accelerate the recruitment process. In this study, we analyzed the eligibility criteria of 100 randomly selected DS clinical trials and identified both computable and non-computable criteria. We mapped annotated entities to OMOP Common Data Model (CDM) with novel entities (e.g., DS). We also evaluated a deep learning model (Bi-LSTM-CRF) for extracting these entities on CLAMP platform, with an average F1 measure of 0.601. This study shows the feasibility of automatic parsing of the eligibility criteria following OMOP CDM for future cohort identification.

12

A model to estimate regional demand for COVID-19 related hospitalizations

Ferstad, J. O.; Gu, A. J.; Lee, R. Y.; Thapa, I.; Shin, A. Y.; Salomon, J. A.; Glynn, P.; Shah, N. H.; Milstein, A.; Schulman, K.; Scheinker, D.

2020-03-30 health informatics 10.1101/2020.03.26.20044842 medRxiv

Top 0.1%

22.6%

Show abstract

COVID-19 threatens to overwhelm hospital facilities throughout the United States. We created an interactive, quantitative model that forecasts demand for COVID-19 related hospitalization based on county-level population characteristics, data from the literature on COVID-19, and data from online repositories. Using this information as well as user inputs, the model estimates a time series of demand for intensive care beds and acute care beds as well as the availability of those beds. The online model is designed to be intuitive and interactive so that local leaders with limited technical or epidemiological expertise may make decisions based on a variety of scenarios. This complements high-level models designed for public consumption and technically sophisticated models designed for use by epidemiologists. The model is actively being used by several academic medical centers and policy makers, and we believe that broader access will continue to aid community and hospital leaders in their response to COVID-19. LINK TO ONLINE MODELhttps://surf.stanford.edu/covid-19-tools/covid-19/

13

Leveraging informative missing data to learn about acute respiratory distress syndrome and mortality in long-term hospitalized COVID-19 patients throughout the years of the pandemic

Getzen, E. J.; Tan, A. L.; Brat, G.; Omenn, G. S.; Strasser, Z.; The Consortium for Clinical Characterization of COVID-19 by EHR (4CE), ; Long, Q.; Holmes, J. H.; Mowery, D.

2023-12-19 health informatics 10.1101/2023.12.18.23300181 medRxiv

Top 0.1%

22.6%

Show abstract

Electronic health records (EHRs) contain a wealth of information that can be used to further precision health. One particular data element in EHRs that is not only under-utilized but oftentimes unaccounted for is missing data. However, missingness can provide valuable information about comorbidities and best practices for monitoring patients, which could save lives and reduce burden on the healthcare system. We characterize patterns of missing data in laboratory measurements collected at the University of Pennsylvania Hospital System from long-term COVID-19 patients and focus on the changes in these patterns between 2020 and 2021. We investigate how these patterns are associated with comorbidities such as acute respiratory distress syndrome (ARDS), and 90-day mortality in ARDS patients. This work displays how knowledge and experience can change the way clinicians and hospitals manage a novel disease. It can also provide insight into best practices when it comes to patient monitoring to improve outcomes.

14

Development and evaluation of a scalable alternative to chart review for phenotype case adjudication using standardized structured data from electronic health records

Ostropolets, A.; Hripcsak, G.; Husain, S. A.; Richter, L. R.; Spotnitz, M.; Elhussein, A.; Ryan, P. B.

2022-12-28 health informatics 10.1101/2022.12.27.22283944 medRxiv

Top 0.1%

22.6%

Show abstract

ObjectiveChart review as the current gold standard for phenotype evaluation cannot support observational research at scale. It is expensive, time-consuming, and variable. We aimed to evaluate the ability of structured data to support efficient patient status ascertainment and develop a standardized and scalable alternative to chart review. MethodsWe developed Knowledge-Enhanced Electronic Patient Profile Review system (KEEPER) that extracts a patients structured data elements relevant to a given phenotype and presents them in a standardized fashion that follows clinical reasoning principles. We evaluated its performance compared to manual chart review for four conditions (diabetes type I, acute appendicitis, end stage renal disease and chronic obstructive lung disease) using randomized two-period, two-sequence crossover design. Inter-method agreement, inter-rater agreement, accuracy, and review duration were measured. ResultsAscertaining patient status with KEEPER was twice as fast compared to manual chart review. 88.1% of the patients were classified concordantly using full chart and KEEPER, but agreement varied depending on the condition. Pairs of clinicians agreed in classification of patient status in 91.2% of the cases when using KEEPER compared to 76.3% when using full chart. Patient classification aligned with the gold standard in 88.1% and 86.9% of the cases respectively. ConclusionThis proof-of-concept study demonstrated that structured data can be used for efficient patient ascertainment if are limited to only relevant subset and organized according to the clinical reasoning principles. A system that implements these principles can achieve similar accuracy and higher inter-rater reliability compared to chart review at a fraction of time.

15

Cohort Identification Using Semantic Web Technologies: Triplestores as Engines for Complex Computable Phenotyping

Pfaff, E.; Bradford, R.; Clark, M.; Balhoff, J. P.; Wang, R.; Preisser, J. S.; Walters, K.; Nielsen, M. E.

2021-12-05 health informatics 10.1101/2021.12.02.21267186 medRxiv

Top 0.1%

22.4%

Show abstract

BackgroundComputable phenotypes are increasingly important tools for patient cohort identification. As part of a study of risk of chronic opioid use after surgery, we used a Resource Description Framework (RDF) triplestore as our computable phenotyping platform, hypothesizing that the unique affordances of triplestores may aid in making complex computable phenotypes more interoperable and reproducible than traditional relational database queries. To identify and model risk for new chronic opioid users post-surgery, we loaded several heterogeneous data sources into a Blazegraph triplestore: (1) electronic health record data; (2) claims data; (3) American Community Survey data; and (4) Centers for Disease Control Social Vulnerability Index, opioid prescription rate, and drug poisoning rate data. We then ran a series of queries to execute each of the rules in our "new chronic opioid user" phenotype definition to ultimately arrive at our qualifying cohort. ResultsOf the 4,163 patients in the denominator, our computable phenotype identified 248 patients as new chronic opioid users after their index surgical procedure. After validation against charts, 228 of the 248 were revealed to be true positive cases, giving our phenotype a PPV of 0.92. ConclusionWe successfully used the triplestore to execute the new chronic opioid user phenotype logic, and in doing so noted some advantages of the triplestore in terms of schemalessness, interoperability, and reproducibility. Future work will use the triplestore to create the planned risk model and leverage the additional links with ontologies, and ontological reasoning.

16

Phenotype Execution and Modelling Architecture (PhEMA) to support disease surveillance and real-world evidence studies: English sentinel network evaluation.

Jamie, G.; Elson, W.; de Lusignan, S.; Kar, D.; Wimalaratna, R.; Hoang, U.; Meza-Torres, B.; Forbes, A.; Hinton, W.; Anand, S.; Ferreira, F.; Ordonez-Mena, J.; Agrawal, U.; Byford, R.

2023-11-22 health informatics 10.1101/2023.11.21.23298758 medRxiv

Top 0.1%

22.3%

Show abstract

ObjectiveTo evaluate Phenotype Execution and Modelling Architecture (PhEMA), to express sharable phenotypes using Clinical Query Language (CQL) and intensional SNOMED CT Fast Healthcare Interoperability Resources (FHIR) valuesets, for exemplar chronic disease, sociodemographic risk factor and surveillance phenotypes. MethodWe curated three phenotypes: Type 2 diabetes (T2DM), excessive alcohol use and incident influenza-like illness (ILI) using CQL to define clinical and administrative logic. We defined our phenotypes with valuesets, using SNOMEDs hierarchy and expression constraint language (ECL), and CQL, combining valuesets and adding temporal elements where needed. We compared the count of cases found using PhEMA with our existing approach using convenience datasets. ResultsThe T2DM phenotype could be defined as two intensionally defined SNOMED valuesets and a CQL script. It increased the prevalence from 7.2% to 7.3%. Excess alcohol phenotype was defined by valuesets that added qualitative clinical terms to the quantitative conceptual definitions we currently use; this change increased prevalence by 58%, from 1.2% to 1.9%. We created an ILI valueset with SNOMED concepts, adding a temporal element using CQL to differentiate new episodes. This increased the weekly incidence in our convenience sample (weeks 26 to 38) from 0.95 cases to 1.11 cases per 100,000 people. ConclusionsPhenotypes for surveillance and research can be described fully and comprehensibly using CQL and intensional FHIR valuesets. Our use case phenotypes identified a greater number of cases, whilst anticipated from excessive alcohol this was not for our other variable. This may have been due to our use of SNOMED CT hierarchy.

17

Automated Extraction of Mortality Information from Publicly Available Sources Using Language Models

Al-Garadi, M. A.; LeNoue-Newton, M.; Matheny, M. E.; McPheeters, M.; Whitaker, J. M.; Deere, J. A.; McLemore, M. F.; Westerman, D.; Khan, M. S.; Hernandez-Munoz, J. J.; Wang, X.; Kuzucan, A.; Desai, R. J.; Reeves, R.

2024-11-01 health informatics 10.1101/2024.10.28.24316027 medRxiv

Top 0.1%

22.3%

Show abstract

BackgroundMortality is a critical variable in healthcare research, especially for evaluating medical product safety and effectiveness. However, inconsistencies in the availability and timeliness of death date and cause of death (CoD) information present significant challenges. Conventional sources such as the National Death Index (NDI) and electronic health records (EHRs) often suffer from data lags, missing fields, or incomplete coverage, limiting their utility in time-sensitive or large-scale studies. With the growing use of social media, crowdfunding platforms, and online memorials, publicly available digital content has emerged as a potential supplementary source for mortality surveillance. Despite this potential, accurate tools for extracting mortality information from such unstructured data sources remain underdeveloped. ObjectiveTo develop scalable approaches using natural language processing (NLP) and large language models (LLM) for the extraction of mortality information from publicly available online data sources, including social media platforms, crowdfunding websites, and online obituaries, and to evaluate their performance across various sources. MethodsData were collected from public posts on X (formerly Twitter), GoFundMe campaigns, memorial websites (EverLoved.com and TributeArchive.com), and online obituaries from 2015 to 2022, focusing on U.S.-based content relevant to mortality. We developed an NLP pipeline using transformer-based models to extract key mortality information such as decedent names, dates of birth, and dates of death. We then employed a few-shot learning (FSL) approach with LLMs to identify primary and secondary causes of death. Model performance was assessed using precision, recall, F1-score, and accuracy metrics, with human-annotated labels serving as the reference standard for the transformer-based model and a human adjudicator blinded to labeling source for the FSL model reference standard. ResultsThe best-performing model obtained a micro-averaged F1-score of 0.88 (95% CI, 0.86-0.90) in extracting mortality information. The FSL-LLM approach demonstrated high accuracy in identifying primary CoD across various online sources. For GoFundMe, the FSL-LLM achieved 95.9% accuracy for primary cause identification, compared to 97.9% for human annotators. In obituaries, FSL-LLM accuracy was 96.5% for primary causes, while human accuracy was 99.0%. For memorial websites, FSL-LLM achieved 98.0% accuracy for primary causes, with human accuracy at 99.5%. ConclusionsThis study demonstrates the feasibility of using advanced NLP and LLM techniques to extract mortality data from publicly available online sources. These methods can significantly enhance the timeliness, completeness, and granularity of mortality surveillance, offering a valuable complement to traditional data systems. By enabling earlier detection of mortality signals and improving CoD classification across large populations, this approach may support more responsive public health monitoring and medical product safety assessments. Further work is needed to validate these findings in real-world healthcare settings and facilitate the integration of digital data sources into national public health surveillance systems.

18

Using Natural Language Processing of Clinical Notes to Supplement Structured Electronic Health Record Data for Phenotyping Smoking and Obesity in a Healthcare System

Yang, J.; Gu, B.; Pillai, H.; Lii, J.; Cronkite, D.; Marsolo, K. A.; Desai, R. J.

2026-01-21 health informatics 10.64898/2026.01.18.26344356 medRxiv

Top 0.1%

22.3%

Show abstract

PurposeStudies based on electronic health records (EHR) often rely on structured data, which may incompletely capture important clinical phenotypes in EHR notes. The purpose of this study was to assess two natural language processing (NLP) tools to extract phenotypes from unstructured EHR notes, and to evaluate the added value of integrating NLP-derived phenotypes with structured EHR data at a health system scale. MethodsThis retrospective study is based on inpatient and outpatient EHR data from the Mass General Brigham healthcare system between January 1, 2019 and December 31, 2020. Two established rule-based NLP tools were applied to extract smoking and obesity information from 19,215,303 clinical notes of 503,025 patients. NLP performance was evaluated through manual review of stratified samples. Phenotype prevalence was estimated using structured EHR data alone and compared with prevalence estimates obtained by supplementing structured data with NLP-derived features. ResultsBoth NLP tools exhibited high performance, with both accuracy and F1 score of 0.99 for smoking, and 0.92 and 0.91 for obesity, respectively. The combination of NLP and structured data identified 220,714 patients (43.88%) with smoking, compared with 170,396 patients (33.87%) identified using structured data alone, representing a 29.5% relative increase. For obesity, NLP identified 121,360 patients (24.12%) from EHR notes, and 169,905 patients (33.78%) were documented in structured data; inclusion of NLP-derived features contributed additional 32,823 patients, corresponding to a 19.3% relative increase. ConclusionNLP-derived phenotypes from unstructured EHR notes substantially improve patient identification for both smoking and obesity compared with structured EHR data alone at scale.

19

DiabetIA: Building Machine Learning Models for Type 2 Diabetes Complications

Tripp, J.; Santana-Quinteros, D.; Perez-Estrada, R.; Rodriguez-Moran, M. F.; Arcos-Gonzalez, C.; Mercado-Rios, J.; Cristobal-Perez, F.; Hernandez-Martinez, B. R.; Nava-Aguilar, M. A.; Gonzalez-Arroyo, G.; Salazar-Fernandez, E. P.; Quiroz-Armada, P. S.; Cortes-Vieyra, R.; Noriega-Cisneros, R.; Zinzun-Ixta, G.; Maldonado-Pichardo, M. C.; Flores-Alvarez, L. J.; Reyes-Granados, S. C.; Chagolla-Morales, R.; Paredes-Saralegui, J. G.; Flores-Garrido, M.; Garcia-Velazquez, L. M.; Figueroa-Mora, K. M.; Gomez-Garcia, A.; Alvarez-Aguilar, C.; Lopez-Pineda, A.

2023-10-23 health informatics 10.1101/2023.10.22.23297277 medRxiv

Top 0.1%

22.3%

Show abstract

BackgroundArtificial intelligence (AI) models applied to diabetes mellitus research have grown in recent years, particularly in the field of medical imaging. However little work has been done exploring real-world data (RWD) sources such as electronic health records (EHR) mostly due to the lack of reliable public diabetes databases. However, with more than 500 million patients affected worldwide, complications of this condition have catastrophic consequences. In this manuscript we aim to first extract, clean and transform a novel diabetes research database, DiabetIA, and secondly train machine learning (ML) models to predict diabetic complications. MethodsIn this study, we used observational retrospective data from the Mexican Institute for Social Security (IMSS) extracting and de-identifying EHR data for almost 2 million patients seen at primary care facilities. After applying eligibility criteria for this study, we constructed a diabetes complications database. Next, we trained naive Bayesian models with various subsets of variables, including an expert-selected model. ResultsThe DiabetIA database is composed of 136,674 patients (414,770 records and 447 variables), with 33,314 presenting diabetes (24.3%). The most frequent diabetic complications were diabetic foot with 2,537 patients, nephropathy with 1,914 patients, retinopathy with 1,829 patients, and neuropathy with 786 patients. These complications were accurately predicted by the Gaussian naive Bayessian models with an average area under the curve AUC of 0.86. Our expert-selected model, achieved an average AUC of 0.84 with 21 curated variables. ConclusionOur study offers the largest longitudinal research database from EHR data in Latin America for research. The DiabetIA database provides a useful resource to estimate the burden of diabetic complications on healthcare systems. Machine learning models can provide accurate estimations of the total cases presented in medical units. For patients and their clinicians, it is imperative to have a way to calculate this risk and start clinical interventions to slow down or prevent the complications of this condition. Brief descriptionThe study centers on establishing the DiabetIA database, a substantial repository encompassing de-identified electronic health records from 136,674 patients sourced from primary care facilities within the Mexican Institute for Social Security (IMSS). Our efforts involved curating, cleansing, and transforming this extensive dataset, and then employing machine learning models to predict diabetic complications with high accuracy.

20

Sharing and Reusing Computable Phenotype Definitions

Visweswaran, S.; Zhang, L. Y.; Bui, K.; Sadhu, E. M.; Samayamuthu, M. J.; Morris, M. M.

2023-09-18 health informatics 10.1101/2023.09.17.23295681 medRxiv

Top 0.1%

22.0%

Show abstract

BackgroundA scalable approach for the sharing and reuse of human-readable and computer-executable phenotype definitions can facilitate the reuse of electronic health records for cohort identification and research studies. DescriptionWe developed a tool called Sharephe for the Informatics for Integrating Biology and the Bedside (i2b2) platform. Sharephe consists of a plugin for i2b2 and a cloud-based searchable repository of computable phenotypes, has the functionality to import to and export from the repository, and has the ability to link to supporting metadata. DiscussionThe i2b2 platform enables researchers to create, evaluate, and implement phenotypes without knowing complex query languages. In an initial evaluation, two sites on the Evolve to Next-Gen ACT (ENACT) network used Sharephe to successfully create, share, and reuse phenotypes. ConclusionThe combination of a cloud-based computable repository and an i2b2 plugin for accessing the repository enables investigators to store and retrieve phenotypes from anywhere and at any time and to collaborate across sites in a research network.